Hash Table Sizes for Storing N-Grams for Text Processing

نویسندگان

  • Zhong Gu
  • Daniel Berleant
چکیده

N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of naïve storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more severe as the number of documents (and hence the number of vectors to store and process) rises. A natural approach to reducing vector size is to hash the large number of possible n-grams into a smaller vector. We address this problem by identifying good and bad hash table sizes over a wide range of sizes. We show that English, French, and German n-grams behave similarly when hashed, and that this is unlike the behavior of randomly generated n-grams. Therefore the difference in behavior is due to properties of the languages themselves. We then investigate different table sizes and identify which sizes are particularly good when hashing ngrams during processing of these languages.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

1 Hash Table Sizes for Storing N - Grams for Text Processing

N-grams have been widely investigated for a number of text processing tasks. However n-gram based systems often labor under the large memory requirements of naïve storage of the large vectors that describe the many n-grams that could potentially appear in documents. This problem becomes more severe as the number of documents (and hence the number of vectors to store and process) rises. A natura...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

Efficient In-memory Data Structures for n-grams Indexing

Indexing n-gram phrases from text has many practical applications. Plagiarism detection, comparison of DNA of sequence or spam detection. In this paper we describe several data structures like hash table or B+ tree that could store n-grams for searching. We perform tests that shows their advantages and disadvantages. One of neglected data structure for this purpose, ternary search tree, is deep...

متن کامل

One-Pass, One-Hash n-Gram Statistics Estimation

In multimedia, text or bioinformatics databases, applications query sequences of n consecutive symbols called n-grams. Estimating the number of distinct n-grams is a view-size estimation problem. While view sizes can be estimated by sampling under statistical assumptions, we desire an unassuming algorithm with universally valid accuracy bounds. Most related work has focused on repeatedly hashin...

متن کامل

Efficient Minimal Perfect Hash Language Models

The recent availability of large collections of text such as the Google 1T 5-gram corpus (Brants and Franz, 2006) and the Gigaword corpus of newswire (Graff, 2003) have made it possible to build language models that incorporate counts of billions of n-grams. This paper proposes two new methods of efficiently storing large language models that allow O(1) random access and use significantly less ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003